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Abstract 

Cj^ , Background: Network motif algorithms have been a topic of research mainly after the 2002-seminal 

paper from Milo et al, that provided motifs as a way to uncover the basic building blocks of most 
networks. In bioinformatics, motifs has been mainly applied in gene regulation networks field. 

Results: This article proposes two new algorithms to exactly count isomorphic pattern motifs of size 
-^ ' 3 and 4 in directed graphs. The algorithms are accelerated by combinatorial techniques. 

• , Let G{V,E) be a directed graph with m — \E\. We describe an 0{my^) time complexity algorithm 

Vlj ' to count isomorphic patterns of size 3. To counting isomorphic patterns of size 4, we propose an 0{m^) 

algorithm. 

Conclusion: The new algorithms were implemented and compared with Fanmod motif detection 
tool. The experiments show that our algorithms are expressively faster than the other tools. We also 
let our tool to detect motifs available in the Internet. 



keywords: network motifs, complex networks, algorithm design and analysis, counting motifs, de- 



j^ , tecting motifs, motifs, discovery, motif isomorphism. 

1 Background 

Network Motifs, or simply motifs, correspond to small patterns that recurrently appear in a complex net- 
work n\ . They can be considered as the basic building blocks of complex networks and their understanding 
may be of interest in several areas, such as Bioinformatics [1UU12J . Communication _23\ and Software Engi- 
neering pT| . 

Finding network motifs has been a matter of attention mainly after the 2002-seminal paper from Milo 
et al. [15 , that proposed motifs as a way to uncover the structural design of complex networks. Nowadays, 
the design of efficient algorithms for network motif discovery is an up-to-date research area. Several surveys 
about motif detection algorithms were published in recent years [SKTHHH] . 

1.1 Problem statement 

This paper, formally, address the following two problems: 
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Problem 1.1 (Motifs-3) Given a directed graph G{V,E), the problem Motifs-S consists in counting the 
number of connected induced subgraphs of G of size 3 grouped by the 13 isomorphic distinct graphs of size 3. 

Problem 1.2 (Motifs-4) Given a directed graph G{V,E), the problem Motifs-4 consists in counting the 
number of connected induced subgraphs of G of size 4 grouped by the 199 isomorphic distinct graphs of size 4- 

1.2 Related work and tools 

The algorithms for motif detection can be based into two main approaches: exact counting or heuristic sam- 
phng. As these names might suggest, the former approach performs a precise count of the isomorphic pattern 
frequency. The latter uses statistics to estimate frequency value. Several exact search-based algorithms and 
tools can be found in the literature, such as MAVisto [20], NeMoFinder [Ij, Kavosh [H] and Grochow and 
Kcllis [7J. Sampling based algorithms examples are MFinder [^I16j. Fanmod [57] and MODA [18]. 

In 2010, Marcus and Shavitt [13, 14 provided an exactly algorithm O(m^) to count network motifs of 
size 4 in undirected graphs. In Section [^Hl the Figure [3] shows the only six connected isomorphic patterns 
with 4 vertices, that can be labeled as: tailed triangle, 4-cycle, 4-cycle with chord, 4-clique, 4-path and Claw. 
In fact, the paper provided six independent algorithms; each one devoted to count an undirected isomorphic 
pattern. Some ideas of Marcus and Shavitt are present in our approach, however, this paper provides a 
solution to directed graph case also. 

1.3 Our approach: combinatorial acceleration 

This paper presents faster exact algorithms to Motif-3 and Motif-4 problems. Basically, two main techniques 
are applied to improve computational complexity: first, our algorithm compute the number of isomorphic 
patterns instead of listing induced subgraphs; second, our method, in fact, do not check isomorphism. The 
algorithms associate an integer variable to each isomorphic pattern and increment it directly. 

Our algorithm was evaluated on transcription of biological networks (bacteria E. coli and the yeast S. 
cerevisiae) and public dataset networks with up to 13,000 vertices and 100,000 edges. The results show 
a significant improvement in performance for Motif-3 and Motif-4. We believe that the technique can be 
extended for detecting motifs of higher sizes. 

The program was implemented in Java and it is made available as freeware in 



http://www.luismeira.com.br/motifs. 



1.4 Paper organization 

The remaining of this paper is organized as follows: Section [5] describes the implemented algorithms; to 
do this, the Section [2T] depicts the notation used. Section [2?2l introduces the new approach starting by the 
simple case of counting isomorphic patterns of size 3, and Section [2.41 and 12.51 shows the method applied to 
counting isomorphic patterns of size 4 in graphs undirected and directed, respectivelly. Section [3] presents 
the computational results, in comparison to other well-known tool available. Finally, Section S] presents the 
conclusion and our view of future work. 

2 Implementation 

The existing exact algorithms to find network motifs are generally extremely costly in terms of CPU time and 
memory consumption, and present restrictions on the size of motifs [H]. According to Cirielo and Guerra [5], 
motif algorithms typically consist of three steps: (i) list connected subgraphs of k vertices in the original 
graph and in a set of randomized graphs; (ii) group them into isomorphic classes; and (iii) determine the 
statistical significance of the isomorphic subgraph classes by comparing their frequencies to those of an 
ensemble of random graphs. The core of this paper focus in items (i) and (ii). 



Section 12.11 presents the notation used, Section 12.21 describes the algorithm to Motif-3 problem, and 
Sections 12.41 and 12.51 describes the algorithm to Motif-4 problem. 

2.1 Notation and definitions 

Let G{V,E) be a directed graph with n = \V\ vertices and m = \E\ edges. Assume that m > n — 1. If 
{u,v) G E and {v,u) E E, we say it is a bidirected edge. Alternatively, if only {u,v) e E, we say it is a 
directed edge. 

Given a vertex v G V, we partitioned the neighbors of v in three disjoint sets: S*{v), S^{v) and S^{v), 
as follows: 

{S*{v), if (u, v) e E and (v, u) e E 
(5+(w), if {v,u) e E and (u,v) ^ E 
5^{v), if (m, v) e E and {v, u) ^ E 

It means that S*(v) are the vertices with a bidirected edge to v. The vertices with edges directed from v 
are in S^{v) and the vertices with edges directed to v are in S~{v). 

Sometimes, for convenience, we consider an undirected version of G(V, E) called G*{V, E*) where {u, w} € 
E* if and only if (u,w) € E or {v,u) € E, or both. Therefore, we replace directed or bidirected edges of 
G by a single undirected edge in G* . Let us define S{v) as the neighbors of v, thus u £ 5{v) if and only if 
{u,v}€E*. Note that S{v) ^ S* (v) U 5+ {v) U S' {v) . 

Given two disjoint sets A CV and B CV, we define S*{A, B) as the set of bidirected edges between the 
sets A and B and 5^ {A, B) as the directed edges from A to B. We also define 5* {A, A) for a single set A as 
the bidirected edges (u, v) with u £ A and v £ A and 5^{A^ A) as directed edges with u £ A and v £ A. 

2.2 Counting isomorphic patterns of size 3 

To simplify notation usage. Table [T] defines a set of auxiliary variables related to a vertex v. 

The symbol A^ — 5*{v) represents the set of bidirected neighbors of v, B" = 5^[v) is the set of directed 
neighbors from w, and C" = 5~(v) is the set of directed neighbors to v. The sets A^ , B'" and C" define a 
partition of vertices in v adjacency. For simplicity of notation, if the vertex v is clear the superscript v can 
be removed. 

The number of bidirected neighbors of v is given by ri"^ . The number of directed edges from A^ to B^ is 
m"^ J. The notation m' is used to represent the number of bidirected edges, for example, m'^^^ is the number 
of bidirected edges between A and B and m' aa is the number of bidirected edges inside A^ . 

The algorithm to count motifs of size 3 is derived from Theorem 12. 21 But, in order to better understand 
it, the following definition is needed: 

Definition 2.1 (ii-Patterns) Given a directed graph G{V,E), we define w-Patterns, for any v £ V, as a 
set of induced subgraphs with three vertices, {v,x,y}, where x and y are in 5{v), which means all induced 
subgraphs with the vertex v and more two vertices in its adjacency. The same definition is valid to the case 
of undirected graph. 

To illustrate the combinatorial optimization technique used in this paper, let us start by analysing a 
simple case. Consider a star graph G^{V^ , E^) with center Vc and neighbors A"" , B^" and C"" as described 
in Figure [TJ 

A naive algorithm to motif counting will compute all vertices in S{vc) combined two by two. We argue that 
it is possible to compute the isomorphic patterns in G^ in constant time 0(1), since we have precomputed 
the auxiliary variables of Table [TJ 

Figure [T] brings an insight about how our algorithm works. It shows, in a simple example, that it is 
possible to count isomorphic patterns without explicit listing all of them. The right side of the figure depicts 
all possible patterns and ocurrency frequencies in the star graph G^ . It is possible to achieve, for instance. 
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Figure 1: Star graph and its sets. Isomorphic pattern frequencies at right. 

that there are exactly n'^''n^'' occurrences of pattern ^^o ^ o ^ o", which means a pattern with a center 
vertex hnked to left neighbor vertex by an bidirected edge and linked to right neighbor by a directed edge 
to it. 

The algorithm to count isomorphic patterns in a general graph derives from the following theorem. 

Theorem 2.2 Lets G{V, E) he a general graph and v any vertex in V . The patterns occurrences in set 
z)-Pattcrns is given by Tahle\^ 

Proof. 

This theorem is proved by induction. Observe that t;-Patterns set considers the patterns containing v 
and two vertices in 5{v). Let G'{E' , V') be the graph G induced by w U S{v). The basic case is if the G" is a 
star graph. In this case, the -y-Patterns frequency are equal to Figure [T] right. Table [2] correspond to it if all 
to" variables are zero, that is the case in G". 

Suppose that a new {x,y) directed edge is added to G" where x and y are in S{v). The new graph has 
eges E' U {u,v). At this moment, our sole interest is devoted to subgraph patterns that contains the vertex 

V. 

The number of pattern '^o ^>- o ^>- o" in the original G" is ri^n"^- If the new directed edge (x, y) has a; S C 
and y € B^ , one pattern ^^o -^ a ^ a " is removed and a cyclic pattern is added. The added pattern is shown 
in Tabled! in the line with a positive nic.b- li u £ A and v £ A one pattern '^o ^ a ^ o" is removed and one 
pattern is added. The added pattern is shown in Table El in the line with a positive ma, a- For each edge 
added in S{v), one pattern containing v is removed and another pattern containing v is added. 

There is a straightforward generalization for an arbitrary number of added edges. Suppose m^ i^ directed 
edges are added from C" to S". The number of "o — > o ^^ o" decreases m" j, units and exactly to" ^ occurrences 
arise from a new, which can be seen in Table El in the line with a positive ruc^b- D 

Thus, given a graph G{V,E), for each vertex z; e T^, it is possible to obtain the u-Patterns frequencies 
using Theorem 12.21 Table |2] shows this pattern frequency (see variable definition in Table [T]) . 

If the variables of Table [1] were preprocessed, it is possible to calculate all isomorphic patterns of size 3 
containing a vertex v € V and two other neighbors of w in 0(1). 

The pattern containing v, a neighbor of v, and a non-neighbor of v will be ignored. Fortunately, valid 
patterns involving these vertexes could computed by their center vertex at another moment. Patterns related 
to C3 will be considered three times each; therefore, a simple correction must be applied. The final counter 
of C3 related isomorphic patterns must be divided by three to provide the correct value. 

The Algorithm [T] counts motifs patterns of size 3. In fact, the algorithm does not perform any isomorphic 
matching. The algorithm creates a vector h with thirteen integers and initialize them with zero. In this 



vector, the pattern "o ^^ o ^i' o" is arbitrarily associated with h[0], the pattern "o ^^ o ^ o" is arbitrarly 
associated to h[l], ^^o —>■ o ^ o" to h[2], and so on. The algorithm compute Table [2] frequencies for each 
V £ V. The frequencies of Tableware incremented in vector h directly. The algorithm output is the vector 
h, containing thirteen isomorphic pattern frequencies. 

Input: Directed graph G{V, E) 

Output: Histogram to 13 isomorphic patterns to motifs of size 3 

1 Create a histogram data structure to count isomorphic patterns 

2 Calculate the variables of Table [1] to all vertices. 

3 foreach v £V do 

4 Calculate the number of patterns involving vertice v using frequencies of Table [H 

5 For each pattern, add this frequency counter to histogram. 

6 end 

r if The undirected version of the pattern is the cycle graph C3 then 

8 I Divide the frequency counter by 3 

9 end 

10 return The histogram. 

Algorithm 1: Count 3 Sized Patterns Algorithm 

The complexity of the algorithm is dominated by Line 2, that computes variables of Table [TJ All 
operations in Algorithm [1] except Line 2, are 0{m). The next section shows how to compute Line 2 in 
0(771-^771). 

2.3 Preprocessing Table [T] 

This section argues that, given a directed graph G{V, E), it is possible to compute Table[T]sets and variables 
in 0(a(G)?77), where a(G) is the arboricity of the undirected version of G. Arboricity was introduced by 
Nash- Williams [T7], the arboricity a{G) of a graph G is the minimum number of forests into which its edges 
can be partitioned. It is known [5J that a{G) ~ 0{\/E) to any graph, so the execution complexity is also 
Olm^/rn). 

First, for each vertex v £V , create three sets A^ ,B^, and C*'. Algorithm [2 describe how to compute such 
variables in 0{m). 

Input: Directed graph G{V, E) 

Output: Variables A" ,3" , and C and n^, n^ and n^ for aU u e V 
1 foreach w e 1^ do 

2 I (yl^e^c^)^ (0,0,0) 

3 end 

4 foreach undirected (u, v) £ E do 

5 ^" ^A"U {v} 

6 A" ^A^'U {u} 

7 end 

8 foreach directed {u, v) E E do 
S" ^ S" U {v} 

10 C" ^ C U {u} 

11 end 

12 foreach v £ V do 

13 I {n:,nl,n-)^i\A^l\B%\n) 

14 end 

Algorithm 2: Create {A",B'",C''} variables. 



In what follows we present an algorithm to compute variables {rnj^ ^, 



,.}. The complexity is 



dominated by Line 5, the algorithm that list all triangles in an undirected graph. If we use Chiba and 
Nishizek algorithm ^ to listing all triangles, we obtain a 0{a{G)m) algorithm, where a{G) is the arboricity 
of G. 



Input: Directed graph G(V, E) 
Output: Variables {rn"^ a^ ■ ■ ■ j "^c'cI- 



1 Let G*{V, E*) be the undirected version of G{V, E). 

2 foreach w G V^ do 

3 I All variables in {m"^ aT ■ ■ j ^cc\ start with zero. 

4 end 

5 List all triangles of G{V, E) and save in variable T 

6 foreach triangle {vi,V2,vz) G T do 

7 Let (u.x.y) <— (ui,i;2,i'3) 
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If vertices 



X&A'' 

xer 
xeC 
xer 



and y eA" 
and y eB" 
and y G C" 
and y G ^" 
and 2/ G S^ 
and yeC 
and y G y^" 
and 2/ G ;B" 
and 2/ G C^ 



Var to increment 

a {x,y) Cz E is undirect 



Var to increment 

li {x,y) E E is direct 



m„ 



'-0,6 



and 



'fc.a 



"•a.b 



mj;^ and ml^ 
m'tn and m^^^ 






and : 
and : 






6,6 






m^ 



m 



c,6 



Do the same to (w, a;, y) -s— (w2, fi, fa) and to (w, a:, y) -r- (ua, wi, 1)2) 



9 end 



Algorithm 3: Create {m^ qj ■ ■ • i ^cd variables 



It is possible to notice that Algorithm [3] is, in essence, processing all triangles in G{V, E). Each increment 
in 771^ (j is an operation in a triangle containing v and two connected neighbors. We remark that there exists 
more straightforward implementations of Algorithm [31 but the use of Chiba and Nishizek algorithm [5J to 
listing all triangles as a subroutine simplify the complexity analysis. 

Thus, it is possible to conclude that the Algorithm [T] to solve Motifs-3 problem presents an 0{a{G)ra) 
time complexity. The memory used in the algorithm is linear in relation to the used memory to represent 
G{V,E). 



2.4 Counting isomorphic patterns of size 4 in undirected graphs 

To show our solution of Motif-4 problem, let us start with an undirected version of the problem. The directed 
case involves more details and will be considered in Section 12.51 

Similarly to previous section, before to start, we need to know the following definition: 

Definition 2.3 (e-Patterns) Given a directed graph G{V,E), we define an e-Patterns, for any e E E, as a 
set of patterns with four vertices, {u,v,vi,V2}, where {u,v) = e and vi and V2 are in S{vl)US{v2)\{vl,V2}■ 
An e-Patterns is a patterns with the edge e and more two vertices in its adjacency. The same to the undirected 
graph. 

The approach to count patterns with 4 vertices is count e-Patterns to all e E E. Lets define Z'^ = 
S{u) n S{v) \ {u, v} as the vertices adjacent to u and v, X'^ = S{u) \ (Z^ U {v}) the vertices only in u adjacency 
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and y^ — 5{v) \ (Z"^ U {u}) the vertices only in v adjacency. If the edge e is clear, it can be omitted from 
the superscript. We define n|, n^ and n| as the sizes {X'^l, |y^| and \Z^\ respectively. See Figure [H 

The Cfc is the cycle graph with k vertices, the Sk is the star graph with a center and k leaves, and the 
Kk is the complete graph of size k. The Kk \ {e} is the complete Kk graph without an arbitrary edge e. 
The Pk is the path graph with k vertices. 
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Figure 2: Sets associated with e = {w, v} and e-Patterns frequency of the graph at left. 
The algorithm to count isomorphic patterns derives from the following theorem. 

Theorem 2.4 Lets G{V,E) be a general undirected graph and e = {u,v} any edge in E. The patterns 
occurrences in set e-Patterns is given by Table O 

Proof. This theorem is also proved by induction in the number of edges. Let G' be the graph G induced by 
{m, v} U 5{u) U 5{v). The basic case is if G" is the graph in Figure [H In this case, the e-Patterns frequency 
are equal to Figure [2] right. Table [31 correspond to it if all nff variables are zero, that is the case in the 
considered graph. 

If one edge e' is added into X"^ , one e-Patterns 53 has to be replaced by one e-Patterns a; <-t> C3. The 
same applies to V . If one edge is added to Z"^, one e-Patterns K4^ \ {e} has to be replaced by one e-Patterns 
K4^. If one edge is added between X'^ and F*^, one e-Patterns P4 needs to be removed and one e-Patterns C4 
must be added. If one edge is added between X'^ or V^ to Z'^, one e-Patterns a; o C3 had to be deleted and 
one e-Patterns K^ \ {e} must be added. Let 'm%y be the number of edges between sets X'^ and V^. Similarly 
consider variables ml^,, m^^j '^y,y^ "^y,^' ^'^'^ "^%,z- Thus, Table El presents the e-Patterns frequency to 
e — {u, w}. D 

The algorithm to count isomorphic patterns of size four will summing up e-Patterns frequencies for all 
eG ^. 

Similarly to Section r2.2[ the pattern containing {m, w}, a neighbor of u or w and a non- neighbor of u and v 
is not considered in e-Patterns set. It is the case of pattern P4 and a no central edge and pattern a; -O- C3 and 
one edge in C3. Fortunately, these patterns will be considered later for other edges. An induced subgraph 
pattern can be considered in distinct e-Patterns sets. So, the final histogram need a small correction. If a 
pattern appears as an e-Patterns to a of its edges, the final histogram result must be dived by a 

The following fact describe this situation. 

Fact 2.5 Based in the Definition \2.3\ and in Figure\^ the C4 patterns is considered by four edges. It is in 
e-Patterns for each of its four edges. The P4 is in the e-Patterns set only for the central edge. The S3 is 
in e-Patterns for each of its three edges. The tailed triangle is in e-Patterns in three of its four edges. The 
K4 \ {e} is in e-Patterns for each of its five edges and K^ is in e-Patterns in each of its six edges. 

The Algorithm [5] counts subgraph patterns of size 4. The complexity is dominated by Line 2, the time 
to calculate the needed variables. All other steps are 0{m). 






(><) Q-O 

Figure 3: Patterns and its edges. The pattern is in e-Patterns of edges in bold. See Definition 

Input: Undirected graph G'(F, -E) 

Output: Histogram to 6 isomorphic patterns to motifs of size 4 

1 Create a histogram to count isomorphic patterns 

2 foreach e E E do 

3 I Calculate variables ri|,ny,n2,TO|j 2:1 "i^,y7'T^|,zi"^y, y:"^y,zj™z, z- 

4 end 

5 foreach e E E do 
Calculate the frequency of e-Patterns using Table [31 
For each pattern, add its frequency counter to histogram. 

8 end 

9 if The histogram pattern is (See Fact 12.50 : 

10 a; <->■ C3 or 53: Divide the frequency counter by 3 

11 C4: Divide the frequency counter by 4 

12 K4 \ {e}: Divide the frequency counter by 5 

13 K4^: Divide the frequency counter by 6 

14 return The histogram. 

Algorithm 4: Count 4 Sized Patterns Algorithm 

As in the previous section, there is no isomorphism matching algorithm. The histogram is represented 
by a vector h with 6 position. The algorithm associates each pattern with an arbitrary hard coded po- 
sition in the vector. For instance, the patterns {P4,x -H> C3, 83,04, K4 \ {e},K4) may be related with 
(/i[0], ft.[l], /i[2], ft.[3], /i[4], /i[5]), respectivelly. To sum the e-Patterns occurrences for a specific e € i? it is 
suficient to update the integer variables in the vector h using Tabic [3] rule. The algorithm output is the 
histogram vector containing pattern frequencies. 

The Algorithm [5] computes the needed variables, according to Line 3 of Algorithm 21 The algorithm 
has an 0(rn) complexity for each e E E. The variables n%, riy and n| are simpler so their calculus was 
omitted. Note that check if x E Z^ for an arbitrary vertice x eV and edge e = {u, v\ can be done checking 
if (x, u) E E and [x, v) E E vci 0(1). To check if a: S X^ and x EY^ Ss similar 

We can conclude that Algorithm |3] counts isomorphic pattern motifs of size 4 in Oim"^ in an undirected 
graph G{y,E). Moreover, the additional memory to store the variables is 8(to) 

2.5 Counting isomorphic patterns of size 4 in directed graphs 

No new concept is needed to extend the previous algorithm to the directed version. However, a large number 
of sets and variables have to be dealt with. In what follows it is presented variables and sets related to an 
edge e. 

Considering an edge e = (u, w), it is possible to define 15 sets associated with it. 

7" = {A\,Bl, Cl,Al,Bl, Cl,AA\ AB\ AC, BA", BB'',BC^, CA", CB", CC} 

defined as follows (see Figure S]): A A" ^ A'' n A", AB^ ^ A"" n S", AC ^ A'' n C", BA" ^ S" n y^", 
BB" ^^BT^B", BC ^B^'nC, CA'' ^CnA", CB^^C"nS^ andCC^^CnC. 



Input: Undirected graph G{V,E) and an edges e € E 
Output: Variables {m^ ^, . . . , m^ zl- 

1 All variables in {to| a^, • ■ • , rnf^z z\ start with zero. 

2 foreach (cc, y) G E do 



If vertices 



Var to increment 



X e X" and y e X^ 
X e X" a.ndye V 
X G X'^ and y e Z'^ 
a; e r*^ and y e Y" 
a; € F'^ and 2/ G Z^ 
a; G Z^ and y G Z'^ 



m" 



4 end 

5 return The computed variables. 

Algorithm 5: Create {m^ 



, ,TO^ 



,} variables for a given e Cz E 




Figure 4: Fifteen sets associated with an edge e = (m, v). 



We also have sets Af ^ A"" \ {5{v) U {w}), Bfi-B'^X {d{v) U {u}) and Cf ^ C" \ ((5(w) U {«}). Finally, 
we have A^ 4- y^^ \ {5{u) U {u}), Bl^B-"\ {5{u) U {«}) and C| 4- C" \ (J(u) U {u}). 

The sets in T*^ make a partition of e adjacency. For instance, a vertex in vi G BJ^ belongs to an outside 
edge from m, a vertex U2 G C| belongs to an inside edge to v. A vertex in v^ G AB*^ belongs to a bidirected 
edge to u and an outside edge from v. 

Given a set T G 7"*^, we define nfp as |T|. Given two sets Ti,T2 G T, we define m^p j, as the number of 
directed edges from Ti to T2 and m^ ^ as the number of bidirected edges between Ti and T2 ■ In other words, 



Al and Ts 



Thus, if Ti = AA'= and T2 = AA" then 
is the number of 



Al and TOai,A2 



for all Ti,T2 G r^ mf,^ j,^ 4- |(5+(ri, Tj)! and m'S^^^, ^ \S*{Ti,T2 
^AA AA i^ *h^ number of directed edges inside AA"^ . If Ti = A^ £ 
bidirected edges between A\ and i?|. 

Preprocessing these variables is the core technique used to accelerate our algorithm. The variables are 
processed only once, then they are used to infer the ocurrence of motifs. 

Consider an edge e = {u, v) and its neighbor sets; for each e € E, the algorithm will analise and count 
the e-Patterns (see Definition 12. 3p . The patterns containing edge e and a vertice not linked to e are ignored. 
Fortunately, all patterns arc considered at least by one of its edges, as discussed in Figure [3] Patterns that 
are considered in more than one e-Patterns must to be corrected at end of the algorithm, as in the undirected 
case. 

Consider a simple graph G' as G{V, E) induced in {u, v} U S{u) U S{v). Consider that there is no edges 
between S{u) and 6{v). This graph is similar to Figure HI Note that the set e-Patterns contains the vertices 



Tahle\^{Al,AA^) paUern{Al, AA^) patter n-" {Af, A A^) pattern^ {Af, AA") pattern^ {Af, A A'') 

(Vl) (V^ (Vl) (V2) (Vl) >(V2) (Vl)< (V2) (vi) 








Figure 5: Variations of matrix pattern(Ti, T2) for Ti = A\ and T2 — AA'^ . 

{u, u} more two vertices {wi, V2} in 5{u) U 5{v). 

Assume that (w, u) is bidirected. To discover the pattern associated to {u,v,vi,V2} it is sufficient to 
fcnown which sets in T^ are associated to vi and V2- For instance, if vi e Ai and V2 € j4i, the associated 
pattern is 5*3. If vi S AA'' and V2 € AA'^, the associated pattern is K^ \ {e}. Let pattern(Ti , T2) for 
all Ti,T2 £ T"^ be the pattern related to {u,v,vi,V2} where e = {u,v) and wi € Ti and W2 € TJj- The 
pattern(Ti, T2) for all Ti and T2 is shown in Table SI 

The algorithm needs the following fact: 

Fact 2.6 Let G' be any graph containing a bidirected edge e = {u,v) more vertices in {u,v) adjacency. 
Assume there is no edges into S{u) U S{v). If it is considered a pattern {u,v,Vi,V2} where wi,W2 belong 
to the same set T G T'^ , there are {2) ocurrences of pattern{T,T). If vi G Ti and V2 G T2 for distinct 
Ti,T2 € T'^ , there are nT^nT2 ocurrences of patter n{Ti,T2). More formally, the frequency of pattern P, 
freq{P), containing {u,v} in G' is 



freq{P) = 



E 

T<£T'':paUern(T,T) = P 



2 



E 

Ti,T2er'':Ti<T2,P= 



UT^nr^ 



■pattern{Ti ,T2) 



It is necessary to define variations of matrix pattern(Ti, T2). If a directed edge (vi, V2) is added in (u, v) 
adjacency, where vi € Ti and V2 € T2, one pattern patter n{Ti,T2) is removed and one pattern is created. 
The created pattern is defined as pattern^ {Ti,T2). If edge (wi,W2) is bidirected, the created pattern is 
defined as pattern'^ (Ti, T2). If vi € T2 and V2 & Ti, the created pattern is pattern'^ (Ti, T2). Figure show 
the patterns created when an edge is added between Ti — A\ and T2 — AA. There is a straightforward 
generalization to other possibilities of Ti and T2 ■ 

The following lemma is used by the algorithm. 

Theorem 2.7 Lets G{V, E) be a general directed graph and e — {u, v) a bidirected edge in E. The patterns 
occurrences in set e-Patterns is given by the following sum: 
Start all frequency patterns as zero. 
foreach T G T'^ do 

Increase pattern{T, T) occurrence by ("^) — tut.t ~ 'ti^ ^ 

Increase pattern^ {T ,T) occurrence bymiprp 

Increase pattern'^ (T,T) occurrence by tut^t 

end 

foreach Ti, Tj G r^ Ti < T2 do 



M:T2 ~ "^Ti,T2 — ™r2,ri 



Increase pattern{Ti,T2) occurrence by nTinx2 
Increase pattern'^ (Ti,T2) occurrence by mip j, 
Increase pattern'^ (Ti,T2) occurrence by rnxj^ ^2 
Increase pattern^ {Ti,T2) occurrence by mT2,Ti 
end 

Proof. This theorem is also proved by induction in the number of edges. Let G" be the graph G induced by 
{u, v} U S{u) U S{v). Suppose also that (u, v) is bidirected. The basic case is if G" does not contains edge in 



10 



5{u) U S{v). In this case, the e-Patterns frequency are given, by construction, by Fact 12.61 The proposed 
sum is equal to Fact 12.61 if all m'^ variables are zero, that is the case in the considered graph. 

If one directed edge (ui,W2) is added into T S 7"*^, one ocurrence oi pattern{T ,T) is removed and one 
ocurrence oi pattern'^ {T ,T) is added. If one bidirected edge (vi,W2) is added into T G T'^, one ocurrence of 
pattern{T,T) is removed and one ocurrence oi pattern^ {T ,T) is added. 

If one directed edge (wi, ^2) is added into two distinct sets Ti, T2 € T"^, one ocurrence oi pattern{Ti, T2) is 
removed and one ocurrence oi pattern'^ (Ti , T2) is added. If the added edge is (w2, wi), the incremented pat- 
tern ocurrence is pattern^ (Ti , T2) . If (fi,t'2) is bidirected, the incremented ocurrence is pattern^ {Ti,T2). 
D 

If the edge e = (u,w) is directed, the pattern associated to {u,v,vi,V2} must replace the bidi- 
rected edge (u, v) by a directed one. The new patterns for vi € Ti and V2 G T2 are represeted 
by pattern' {Ti,T2), pattern'~* (Ti , T2) , pattern'*" {Ti,T2), pattern"^ {Ti,T2) instead of pattern{Ti,T2), 
pattern"* {Ti, T2), pattern'^ {Ti, T2), pattern^ {Ti, T2). The results are the same for a directed edge. 

Corollary 2.8 Lets G{V, E) he a general directed graph and e = (u, v) a directed edge in E. The pat- 
terns occurrences in set e-Patterns can be calculated analogous to Theorem \2.T\ but using pattern' {Ti,T2), 
pattern'^ {Ti,T2), pattern"^ {Ti,T2), pattern'^ {Ti,T2) instead of pattern(Ti,T2), pattern~^{Ti,T2), 
pattern^ {Ti,T2) , pattern'^ {Ti,T2) for any Ti,T2 G T*^. 

The Algorithmic is used to count patterns of size 4 by summing the e-Patterns for all e e i? and make a 
correction if the same induced subgraph was considered many times. 

Input: Directed graph G{V, E) 

Output: Histogram to 199 isomorphic patterns to motifs of size 4 

1 Create a histogram to count isomorphic patterns 

2 foreach e G E do 
Calculate the variables n't for all X'^ £ T"^. 



Calculate the variables rrix^y and m'^ for all X'^, V^ E T'^. 



3 

4 

5 end 

6 foreach e € E do 

Calculate the frequency involving e and two neighbors using Lemma 1^771 or Corollarv l2.8l For each 
pattern, add this frequency counter to histogram. 

8 end 

9 if the the pattern is related to: 

10 a: o C3 or S^: Divide the frequency counter by 3 

11 C4: Divide the frequency counter by 4 

12 K4^ \ {e}: Divide the frequency counter by 5 

13 K/^: Divide the frequency counter by 6 

14 return The histogram. 

Algorithm 6: Count 4 Sized Patterns Algorithm 

As in the undirected case, there is no isomorphism matching processing. The resultant histogram is 
represented by a vector of integers h with 199 position, one for each distinct isomorphic pattern of size 4. 
It is necessary to preprocess matrices pattern{, ), pattern~^ {, ), etc., associating each pattern to an arbitrary 
position in h. For instance, we can set the patter n{Al, A2) to h[Q\. Thepattern(Ai, ^1) and patter n{A2^ A2) 
are isomorphic, so they both can be associated to h[l]. As a final result, each pattern in the used matrixes 
must be hard coded associated to a position in the vector h of the histogram, which will be the program 
output. 

The complexity of the algorithm is dominated by Line 2, since all other lines are 0{m). We argue that 
an algorithm similar to Algorithm [5] in Section \TM, can calculate the needed variables in 0{m?). Thus, it 
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is possible to conclude that the proposed algorithm is an 0{ni^) algorithm to calculate motifs of size 4 in a 
directed graph. 

3 Results and Discussion 

This section compares the computational results of our algorithm, that we called acc-MOTIF (acceler- 
ated Motif) , with Fanmod |21j . We choose Fanmod because it is one of the fastest available motif finder 
programs [2^ . 

The instances were arbitrarily selected from a wide range of motif applications. They are selected from 
open complex network databases such as Pajek and Uri Alon datasets [IJ[3]. We preprocessed the instances, 
removing vertices with zero neighbors. 

The implemented algorithms are just devised to motifs of size 3 and 4. To ensure replicability and better 
evaluation, we have provided the input tested graphs and the Java byte-code of implemented algorithms 



available at http://www.luismeira.com.br/motifsl 



All the tests were performed in an Intel 17-2600, 3.40GHz, 4GB RAM using an algorithm implemented in 
Java language. We set Fanmod with the full enumeration parameter. Thus, both Fanmode and add-MOTIF 
solve the same problem, that consist in count all subgraph of the selected size. 

Table [S] shows the execution time of Fanmod and acc-MOTIF. The algorithm is executed in the original 
graph and in one hundred of random graphs. The time reported is the average considering the original and 
the random graphs. In this experiment it is considered only the execution time to enumerate all subgraphs. 
The time to generate the random graphs is not considered. 

Each round consists in the subgraph enumeration in the original more one hundred of random graphs. 
We repeated the execution by five rounds. Table E] contains the average and the deviation factor of this five 
measurements. 

We limited the CPU time to 7,200s per graph for sake of convenience. In Table [SJ it is possible to 
observe that the proposed algorithms were expressively faster than Fanmod in almost all tested instances. 
In the instance Foldoc ^, with 109,092 edges, Fanmod spent 439s/graph to enumerate subgraphs of size 
4. Acc-MOTIF solve the same instance in 13s/graph. In the instance Words E. [T], with 46,281 edges, 
Fanmod spent 7, 028ms/graph to enumerate subgraphs of size 3 . Acc-MOTIF solve the same instance in 
46ms/graph. 

4 Conclusion 

Two new exact algorithms were presented to count isomorphic pattern motifs using combinatorial techniques. 
The algorithms have complexity 0(rny/rri) to count isomorphic patterns of size 3 and 0{m?) to count 
isomorphic patterns of size 4. Computational results show that the proposed exact algorithms are expressively 
faster than known techniques (e.g., Fanmod). 

The following step in this research is, based on the same combinatorial techniques, to detail an algorithm 
to motifs of size 5. It appears to be natural to extend the e-Patterns to a X-Patterns, where the size of X 
is bigger than e. 

5 Availability and requirements 

Project name: acc-MOTIF - Accelerated Motif Detection Using Combinatorial Techniques 

Project home page: http://www.luismeira. com/motifs^ 

Operating system(s): Platform independent 

Programming language: Java 

Other requirements: e.g. Java 1.6. 1GB Ram. 
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License: Freeware. 

Any restrictions to use by non-academics: None. 
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are involved in the Java implementation. 
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A" = 5*{v) 


B" = S+{v) 


r = 5-{v) 


K = 1-4"! 


nl = \B-\ 


< = ri 


ml,= 5+{A\B-)\ 


ml,^\5+{A",n\ 


ml^ = \5+{B\C")\ 


ml^ = \5+{B-,A^)\ 


ml, = \5+{r,A^)\ 


ml, = \5+{r,B-)\ 


ml,^\5+{A",A^)\ 


ml, = \5+{B\B-)\ 


ml,= \5+{r,n\ 


m'l,^m',\^\5*{A\B-)\ 


m'^^,^m'l,^\5*{A^,n\ 


<:c = rn'l,^\5*{B\n\ 


m'-,, = \5*{A\A^)\ 


TO- =r(S",S")| 


m'i, = \5*{r,r)\ 



Table 1: Variables to vertex v. 



Pattern 



Pattern 




Frequency 


^ O ^f o 


< 


nl - mlb - ml^ - m'^i, 


o ^ o ^ o 


« - ml,c - K,a - m'^,aC 


^ o ^ o 


nl 


nl - mlc - mlf, - m'^,^ 


0^ o ^ o 




/n"\ 

(;j-<a-<a 


^ O ^f o 




("2^) - <, <, 


^>^ o ■^ o 




\ 2 J 



Frequency 



(A — Po 

O^ i^O 

0< i^O 

(A i^O 

o^ — -o 
o^ — *o 



o- s-o 



%,€ 



i-a.b 



Cb^ 

V 



'■c,b 



m'lc + K,a 



'6.6 



'7^c,6 + "^6,6 + "^c,c 

Table 2: Isomorphic pattern frequencies involving vertex v and two neighbors. 



Pattern 



Frequency 



Pi 

a; o C3 

Ka \ {e} 

Ci 



(nl + <X - ml 



V,z 



my 



v,v 



v,v 



m. 



Table 3: e-Patterns frequencies for e = {m, w} 
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Bi 

Ci 

B2 

Ci 

AA 

AB 

AC 

BA 

BB 

BC 

CA 

CB 

CC 



Ax Bi 



Ci A2 



Bo 



C2 AA AB AC BA BB BC CA CB CC 







E 



n 







Table 4: If the top blue edge is ignored, this table represents pattern(Ti, T2) for all Ti, T2 € T"^. Symmetric 
side omitted 







Motifs k — 'i 


(miliseconds) 


Motifs fc = 4 


(seconds) 


Instance 


{n,m) 


acc-MOTIF 


Fanmod 


acc-MOTIF 


Fanmod 


E. coH [5 


(418,519) 


0.9 ±0.1 


2.7 ±0.2 


0.021 ±0.001 


0.08 ±0.003 


Yeast [U 


(688, 1079) 


0.9 ±0.04 


7.4 ±0.6 


0.043 ± 0.0004 


0.19 ±0.002 


Roget [3] 


(1022,5074) 


2.0 ±0.04 


34 ± 0.5 


0.270 ±0.010 


0.76 ±0.01 


Csphd [5] 


(1882,1740) 


1.2 ±0.03 


3.2 ±0.2 


0.055 ±0.001 


0.04 ±0.0005 


Epa [3] 


(4271,8965) 


2.7 ±0.3 


131 ±1 


0.58 ±0.01 


9.2 ±0.07 


California [3] 


(6175,16150) 


4.3 ±0.1 


216 ±2 


1.2 ±0.01 


12.6 ±0.02 


ODLIS [3] 


(2900, 18241) 


8.1 ±0.3 


1,025 ±5 


4.5 ±0.03 


210 ±2 


Words E. Llj 


(7381,46281) 


46 ± 0.4 


7, 028 ±174 


105 ±0.7 


>7200 


PairsFSG [3] 


(5018,63608) 


42 ± 0.3 


1,687 ±19 


13 ±0.4 


153 ±3 


Foldoc [5] 


(12905, 109092) 


92 ±1 


2, 938 ±7 


13.3 ±0.5 


439 ±8 



Table 5: Execution time to count isomorphic patterns of size 3 and 4 by processed graph using acc-MOTIF 
and Fanmod. 
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